Overview

  1. Project Instructions & Prerequisites
  2. Learning Objectives
  3. Data Preparation
  4. Create Categorical Features with TF Feature Columns
  5. Create Continuous/Numerical Features with TF Feature Columns
  6. Build Deep Learning Regression Model with Sequential API and TF Probability Layers
  7. Evaluating Potential Model Biases with Aequitas Toolkit

1. Project Instructions & Prerequisites

Project Instructions

Context: EHR data is becoming a key source of real-world evidence (RWE) for the pharmaceutical industry and regulators to make decisions on clinical trials. You are a data scientist for an exciting unicorn healthcare startup that has created a groundbreaking diabetes drug that is ready for clinical trial testing. It is a very unique and sensitive drug that requires administering the drug over at least 5-7 days of time in the hospital with frequent monitoring/testing and patient medication adherence training with a mobile application. You have been provided a patient dataset from a client partner and are tasked with building a predictive model that can identify which type of patients the company should focus their efforts testing this drug on. Target patients are people that are likely to be in the hospital for this duration of time and will not incur significant additional costs for administering this drug to the patient and monitoring.

In order to achieve your goal you must build a regression model that can predict the estimated hospitalization time for a patient and use this to select/filter patients for your study.

Expected Hospitalization Time Regression Model: Utilizing a synthetic dataset(denormalized at the line level augmentation) built off of the UCI Diabetes readmission dataset, students will build a regression model that predicts the expected days of hospitalization time and then convert this to a binary prediction of whether to include or exclude that patient from the clinical trial.

This project will demonstrate the importance of building the right data representation at the encounter level, with appropriate filtering and preprocessing/feature engineering of key medical code sets. This project will also require students to analyze and interpret their model for biases across key demographic groups.

Please see the project rubric online for more details on the areas your project will be evaluated.

Dataset

Due to healthcare PHI regulations (HIPAA, HITECH), there are limited number of publicly available datasets and some datasets require training and approval. So, for the purpose of this exercise, we are using a dataset from UC Irvine(https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008) that has been modified for this course. Please note that it is limited in its representation of some key features such as diagnosis codes which are usually an unordered list in 835s/837s (the HL7 standard interchange formats used for claims and remits).

Data Schema The dataset reference information can be https://github.com/udacity/nd320-c1-emr-data-starter/blob/master/project/data_schema_references/ . There are two CSVs that provide more details on the fields and some of the mapped values.

Project Submission

When submitting this project, make sure to run all the cells before saving the notebook. Save the notebook file as "student_project_submission.ipynb" and save another copy as an HTML file by clicking "File" -> "Download as.."->"html". Include the "utils.py" and "student_utils.py" files in your submission. The student_utils.py should be where you put most of your code that you write and the summary and text explanations should be written inline in the notebook. Once you download these files, compress them into one zip file for submission.

Prerequisites

  • Intermediate level knowledge of Python
  • Basic knowledge of probability and statistics
  • Basic knowledge of machine learning concepts
  • Installation of Tensorflow 2.0 and other dependencies(conda environment.yml or virtualenv requirements.txt file provided)

Environment Setup

For step by step instructions on creating your environment, please go to https://github.com/udacity/nd320-c1-emr-data-starter/blob/master/project/README.md.

2. Learning Objectives

By the end of the project, you will be able to

  • Use the Tensorflow Dataset API to scalably extract, transform, and load datasets and build datasets aggregated at the line, encounter, and patient data levels(longitudinal)
  • Analyze EHR datasets to check for common issues (data leakage, statistical properties, missing values, high cardinality) by performing exploratory data analysis.
  • Create categorical features from Key Industry Code Sets (ICD, CPT, NDC) and reduce dimensionality for high cardinality features by using embeddings
  • Create derived features(bucketing, cross-features, embeddings) utilizing Tensorflow feature columns on both continuous and categorical input features
  • SWBAT use the Tensorflow Probability library to train a model that provides uncertainty range predictions that allow for risk adjustment/prioritization and triaging of predictions
  • Analyze and determine biases for a model for key demographic groups by evaluating performance metrics across groups by using the Aequitas framework

3. Data Preparation

In [1]:
# install packages
import sys
import pkg_resources
from IPython.display import display, Markdown, Latex, clear_output, display_html

# check if plotly exists in the library 
# if not, install the package & restart the notebook
if len([True for d in pkg_resources.working_set if d.project_name in ['plotly']]) == 0:
    
    # install plotly
    !{sys.executable} -m pip install plotly  # you may want to restart your kernel if it can't find the plotly package
                                             # after installing it
    
    # restart the notebook
    clear_output()
    display(Markdown('<br/><span style="color: #dd0000; font-size: 34px; font-weight: 700;">Re-run your cells!<br/> The notebook has been restarted!</span>'))
    display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)
In [2]:
# from __future__ import absolute_import, division, print_function, unicode_literals
import os
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_probability as tfp
import matplotlib.pyplot as plt
import pandas as pd
import aequitas as ae
from IPython.display import display, Markdown, Latex, clear_output

import plotly.graph_objects as go
from plotly.subplots import make_subplots
pd.options.plotting.backend = "plotly"

# Put all of the helper functions in utils
from utils import build_vocab_files, show_group_stats_viz, aggregate_dataset, preprocess_df, df_to_dataset, posterior_mean_field, prior_trainable
pd.set_option('display.max_columns', 500)

# this allows you to make changes and save in student_utils.py and the file is reloaded every time you run a code block
%load_ext autoreload
%autoreload
In [3]:
#OPEN ISSUE ON MAC OSX for TF model training
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

Dataset Loading and Schema Review

Load the dataset and view a sample of the dataset along with reviewing the schema reference files to gain a deeper understanding of the dataset. The dataset is located at the following path https://github.com/udacity/nd320-c1-emr-data-starter/blob/master/project/starter_code/data/final_project_dataset.csv. Also, review the information found in the data schema https://github.com/udacity/nd320-c1-emr-data-starter/blob/master/project/data_schema_references/

In [4]:
dataset_path = "./data/final_project_dataset.csv"

#
# Add ?, '?|?', 'None' as not a number
#
df = pd.read_csv(dataset_path, na_values=[ '', ' ', '?', '?|?','None', '-NaN', '-nan', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'])
df.head()
Out[4]:
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty primary_diagnosis_code other_diagnosis_codes number_outpatient number_inpatient number_emergency num_lab_procedures number_diagnoses num_medications num_procedures ndc_code max_glu_serum A1Cresult change readmitted
0 2278392 8222157 Caucasian Female [0-10) NaN 6 25 1 1 NaN Pediatrics-Endocrinology 250.83 NaN 0 0 0 41 1 1 0 NaN NaN NaN No NO
1 149190 55629189 Caucasian Female [10-20) NaN 1 1 7 3 NaN NaN 276 250.01|255 0 0 0 59 9 18 0 68071-1701 NaN NaN Ch >30
2 64410 86047875 AfricanAmerican Female [20-30) NaN 1 1 7 2 NaN NaN 648 250|V27 2 1 0 11 6 13 5 0378-1110 NaN NaN No NO
3 500364 82442376 Caucasian Male [30-40) NaN 1 1 7 2 NaN NaN 8 250.43|403 0 0 0 44 7 16 1 68071-1701 NaN NaN Ch NO
4 16680 42519267 Caucasian Male [40-50) NaN 1 1 7 1 NaN NaN 197 157|250 0 0 0 51 5 8 0 0049-4110 NaN NaN Ch NO

Determine Level of Dataset (Line or Encounter)

Question 1: Based off of analysis of the data, what level is this dataset? Is it at the line or encounter level? Are there any key fields besides the encounter_id and patient_nbr fields that we should use to aggregate on? Knowing this information will help inform us what level of aggregation is necessary for future steps and is a step that is often overlooked.

In [5]:
txt = f'<b>Amount of Rows: </b>{df.shape[0]}  \n'
txt += f'<b>Amount of Patients: </b>{df["patient_nbr"].nunique()}  \n'
txt += f'<b>Amount of Encounter IDs: </b>{df["encounter_id"].nunique()}  \n'

txt += '\n<b>Line or Encounter: </b>'

if   ('encounter_id' in list(df)) and (len(df) > df['encounter_id'].nunique()):
    txt += "Dataset could be at the **Line Level**"

elif ('encounter_id' in list(df)) and (len(df) == df['encounter_id'].nunique()):
    txt += "Dataset could be at the **Encounter Level**"

else:
    txt += "Dataset could be at the **Longitudinal Level**"
    
display(Markdown(txt))

Amount of Rows: 143424
Amount of Patients: 71518
Amount of Encounter IDs: 101766

Line or Encounter: Dataset could be at the Line Level

Answer: Dataset is at the Line Level, because there are more rows than Encounter Ids.

Analyze Dataset

In [6]:
#
# According to the documentation, i've put the features into 2 sets
# URL: https://www.hindawi.com/journals/bmri/2014/781670/tab1/
#

#
# patient and encounter id
# (statistically not interesting)
ids = ['encounter_id','patient_nbr']

#
# categorical (nominal) features
#
features_categorical = ['race', 
                        'gender',
                        'age', 
                        'weight',                    # documentation says numeric, but here it is categorical
                        'admission_type_id',
                        'discharge_disposition_id',
                        'admission_source_id', 
                        'payer_code',
                        'medical_specialty',
                        'primary_diagnosis_code',
                        'other_diagnosis_codes',
                        'ndc_code',
                        'max_glu_serum',
                        'A1Cresult',
                        'change',
                        'readmitted']

#
# numerical features
#
features_numerical = ['time_in_hospital',    # Integer number of days between admission and discharge
                      'number_outpatient',
                      'number_inpatient',
                      'number_emergency',
                      'num_lab_procedures',
                      'number_diagnoses',
                      'num_medications',
                      'num_procedures']

Question 2: Utilizing the library of your choice (recommend Pandas and Seaborn or matplotlib though), perform exploratory data analysis on the dataset. In particular be sure to address the following questions:

- a. Field(s) with high amount of missing/zero values
- b. Based off the frequency histogram for each numerical field,  
     which numerical field(s) has/have a Gaussian(normal) distribution shape?
- c. Which field(s) have high cardinality and why (HINT: ndc_code is one feature)
- d. Please describe the demographic distributions in the dataset for the age and gender fields.

OPTIONAL: Use the Tensorflow Data Validation and Analysis library to complete.

  • The Tensorflow Data Validation and Analysis library(https://www.tensorflow.org/tfx/data_validation/get_started) is a useful tool for analyzing and summarizing dataset statistics. It is especially useful because it can scale to large datasets that do not fit into memory.
  • Note that there are some bugs that are still being resolved with Chrome v80 and we have moved away from using this for the project.
In [7]:
df_columns_info = [{'Feature': col, 
                    'Frquency': df.shape[0],
                    '# Freq_Null': df[col].isnull().sum(), 
                    '% Freq_Null': np.around(df[col].isnull().sum()/df.shape[0]*100, 1), 
                    '# Freq_Zeros': '-', 
                    '% Freq_Zeros': '-', 
                    'Type': 'Categorical'} for col in list(features_categorical)]

df_columns_info +=  [{'Feature': col, 
                      'Frquency': df.shape[0],
                      '# Freq_Null': df[col].isnull().sum(), 
                      '% Freq_Null': np.around(df[col].isnull().sum()/df.shape[0]*100, 1), 
                      '# Freq_Zeros': (df[col]==0).sum(), 
                      '% Freq_Zeros': np.around((df[col] == 0).sum()/df.shape[0]*100, 1), 
                      'Type': 'Numerical'} for col in list(features_numerical)]

# show missings and zeros
txt = '## Missings and zeros '
display(Markdown(txt))
pd.DataFrame(df_columns_info)

Missings and zeros

Out[7]:
Feature Frquency # Freq_Null % Freq_Null # Freq_Zeros % Freq_Zeros Type
0 race 143424 3309 2.3 - - Categorical
1 gender 143424 0 0.0 - - Categorical
2 age 143424 0 0.0 - - Categorical
3 weight 143424 139122 97.0 - - Categorical
4 admission_type_id 143424 0 0.0 - - Categorical
5 discharge_disposition_id 143424 0 0.0 - - Categorical
6 admission_source_id 143424 0 0.0 - - Categorical
7 payer_code 143424 54190 37.8 - - Categorical
8 medical_specialty 143424 69463 48.4 - - Categorical
9 primary_diagnosis_code 143424 33 0.0 - - Categorical
10 other_diagnosis_codes 143424 340 0.2 - - Categorical
11 ndc_code 143424 23462 16.4 - - Categorical
12 max_glu_serum 143424 136409 95.1 - - Categorical
13 A1Cresult 143424 117650 82.0 - - Categorical
14 change 143424 0 0.0 - - Categorical
15 readmitted 143424 0 0.0 - - Categorical
16 time_in_hospital 143424 0 0.0 0 0 Numerical
17 number_outpatient 143424 0 0.0 120027 83.7 Numerical
18 number_inpatient 143424 0 0.0 96698 67.4 Numerical
19 number_emergency 143424 0 0.0 127444 88.9 Numerical
20 num_lab_procedures 143424 0 0.0 0 0 Numerical
21 number_diagnoses 143424 0 0.0 0 0 Numerical
22 num_medications 143424 0 0.0 0 0 Numerical
23 num_procedures 143424 0 0.0 65788 45.9 Numerical
In [8]:
fig = make_subplots(rows=int(np.ceil(len(features_numerical)/2)), cols=2)

for e, feature in enumerate(features_numerical, start=2):
    # create
    trace = go.Histogram(x=df[feature], name=feature)

    # store the plot
    fig.append_trace(trace, e//2, e%2 + 1)

    # update figure
fig.update_layout(  title="Histograms",
                    legend_title="Features",
                    font=dict(
                        size=16,
                    )
                )

display(Markdown("## Numerical Features: Histgrams"))
fig

Numerical Features: Histgrams

Answer: There are 3 clear Gaussian normal distribution

- time_in_hospital
- num_lab_procedures
- num_medications

Categorical Features: High Cardinality

In [9]:
for feature in features_categorical:
    
    # create
    vc = df[feature].fillna('None').value_counts().reset_index()
    vc.columns = ['Value', '# Amount']
    vc['% Perc.'] = np.around(vc['# Amount']/vc['# Amount'].sum()*100, 1)
    
    # print result 
    text = f'### Feature: {feature}  \n'
    text += f'<b>Nr of categories</b>: {vc.shape[0]}  \n'
    text += f'<b>Missing values</b>: {df[feature].isnull().sum() > 0}  \n'
    text += f'<b>High Cardinality</b>: {"Yes" if vc.shape[0] > 100 else "No"}'
    display(Markdown(text))
    display(vc)
    print('')

Feature: race

Nr of categories: 6
Missing values: True
High Cardinality: No

Value # Amount % Perc.
0 Caucasian 107688 75.1
1 AfricanAmerican 26427 18.4
2 None 3309 2.3
3 Hispanic 2938 2.0
4 Other 2174 1.5
5 Asian 888 0.6

Feature: gender

Nr of categories: 3
Missing values: False
High Cardinality: No

Value # Amount % Perc.
0 Female 76185 53.1
1 Male 67234 46.9
2 Unknown/Invalid 5 0.0

Feature: age

Nr of categories: 10
Missing values: False
High Cardinality: No

Value # Amount % Perc.
0 [70-80) 36928 25.7
1 [60-70) 32741 22.8
2 [50-60) 25095 17.5
3 [80-90) 23527 16.4
4 [40-50) 13729 9.6
5 [30-40) 4964 3.5
6 [90-100) 3619 2.5
7 [20-30) 1927 1.3
8 [10-20) 733 0.5
9 [0-10) 161 0.1

Feature: weight

Nr of categories: 10
Missing values: True
High Cardinality: No

Value # Amount % Perc.
0 None 139122 97.0
1 [75-100) 1817 1.3
2 [50-75) 1133 0.8
3 [100-125) 890 0.6
4 [125-150) 200 0.1
5 [25-50) 118 0.1
6 [0-25) 67 0.0
7 [150-175) 55 0.0
8 [175-200) 18 0.0
9 >200 4 0.0

Feature: admission_type_id

Nr of categories: 8
Missing values: False
High Cardinality: No

Value # Amount % Perc.
0 1 74713 52.1
1 3 27756 19.4
2 2 26823 18.7
3 6 7015 4.9
4 5 6584 4.6
5 8 488 0.3
6 7 33 0.0
7 4 12 0.0

Feature: discharge_disposition_id

Nr of categories: 26
Missing values: False
High Cardinality: No

Value # Amount % Perc.
0 1 85308 59.5
1 3 19677 13.7
2 6 18945 13.2
3 18 4658 3.2
4 22 3077 2.1
5 2 2906 2.0
6 11 1911 1.3
7 5 1631 1.1
8 25 1285 0.9
9 4 1090 0.8
10 7 782 0.5
11 23 602 0.4
12 13 521 0.4
13 14 432 0.3
14 28 200 0.1
15 8 147 0.1
16 15 93 0.1
17 24 65 0.0
18 9 29 0.0
19 17 20 0.0
20 16 19 0.0
21 19 8 0.0
22 10 6 0.0
23 27 5 0.0
24 20 4 0.0
25 12 3 0.0

Feature: admission_source_id

Nr of categories: 17
Missing values: False
High Cardinality: No

Value # Amount % Perc.
0 7 80443 56.1
1 1 42773 29.8
2 17 9338 6.5
3 4 4467 3.1
4 6 3108 2.2
5 2 1500 1.0
6 5 1048 0.7
7 20 247 0.2
8 3 247 0.2
9 9 185 0.1
10 8 27 0.0
11 22 21 0.0
12 10 10 0.0
13 25 4 0.0
14 11 3 0.0
15 14 2 0.0
16 13 1 0.0

Feature: payer_code

Nr of categories: 18
Missing values: True
High Cardinality: No

Value # Amount % Perc.
0 None 54190 37.8
1 MC 46532 32.4
2 HM 8784 6.1
3 SP 7613 5.3
4 BC 6991 4.9
5 MD 4983 3.5
6 CP 3687 2.6
7 UN 3665 2.6
8 CM 2971 2.1
9 OG 1532 1.1
10 PO 919 0.6
11 DM 757 0.5
12 WC 230 0.2
13 CH 208 0.1
14 OT 160 0.1
15 MP 122 0.1
16 SI 79 0.1
17 FR 1 0.0

Feature: medical_specialty

Nr of categories: 73
Missing values: True
High Cardinality: No

Value # Amount % Perc.
0 None 69463 48.4
1 InternalMedicine 20403 14.2
2 Emergency/Trauma 11595 8.1
3 Family/GeneralPractice 10508 7.3
4 Cardiology 7473 5.2
... ... ... ...
68 Perinatology 1 0.0
69 Pediatrics-InfectiousDiseases 1 0.0
70 Speech 1 0.0
71 Proctology 1 0.0
72 SportsMedicine 1 0.0

73 rows × 3 columns


Feature: primary_diagnosis_code

Nr of categories: 717
Missing values: True
High Cardinality: Yes

Value # Amount % Perc.
0 414 9473 6.6
1 428 9385 6.5
2 786 5432 3.8
3 486 5226 3.6
4 410 5076 3.5
... ... ... ...
712 939 1 0.0
713 318 1 0.0
714 391 1 0.0
715 V07 1 0.0
716 61 1 0.0

717 rows × 3 columns


Feature: other_diagnosis_codes

Nr of categories: 19374
Missing values: True
High Cardinality: Yes

Value # Amount % Perc.
0 250|401 3637 2.5
1 401|250 3060 2.1
2 276|276 968 0.7
3 414|250 922 0.6
4 428|427 911 0.6
... ... ... ...
19369 850|305 1 0.0
19370 759|459 1 0.0
19371 327|281 1 0.0
19372 808|518 1 0.0
19373 780|464 1 0.0

19374 rows × 3 columns


Feature: ndc_code

Nr of categories: 252
Missing values: True
High Cardinality: Yes

Value # Amount % Perc.
0 None 23462 16.4
1 68071-1701 20770 14.5
2 47918-902 20379 14.2
3 47918-898 6568 4.6
4 0173-0861 4060 2.8
... ... ... ...
247 43063-699 1 0.0
248 0228-2752 1 0.0
249 0781-5635 1 0.0
250 64764-302 1 0.0
251 33342-177 1 0.0

252 rows × 3 columns


Feature: max_glu_serum

Nr of categories: 4
Missing values: True
High Cardinality: No

Value # Amount % Perc.
0 None 136409 95.1
1 Norm 3220 2.2
2 >200 2043 1.4
3 >300 1752 1.2

Feature: A1Cresult

Nr of categories: 4
Missing values: True
High Cardinality: No

Value # Amount % Perc.
0 None 117650 82.0
1 >8 13110 9.1
2 Norm 6955 4.8
3 >7 5709 4.0

Feature: change

Nr of categories: 2
Missing values: False
High Cardinality: No

Value # Amount % Perc.
0 Ch 88669 61.8
1 No 54755 38.2

Feature: readmitted

Nr of categories: 3
Missing values: False
High Cardinality: No

Value # Amount % Perc.
0 NO 77248 53.9
1 >30 50434 35.2
2 <30 15742 11.0

Answer: from the above tables, if I set my threshold for cardinality to 100, then I have 3 high cardinally features:

- other_diagnosis_codes
- primary_diagnosis_code
- ndc_code

Demographic distributions in the dataset for the age and gender fields

In [10]:
text = '<b>Absolute:</b> Distribution of Age by gender.'

df_dist = pd.crosstab(df['gender'], df['age'])
df_dist['Total'] = df_dist.sum(axis=1) 
df_dist['% Total'] = np.around(df_dist['Total']/df_dist['Total'].sum()*100, 1)
df_dist = df_dist.T
df_dist['Total'] = df_dist.sum(axis=1) 
df_dist['% Total'] =  np.around(df_dist['Total']/df_dist.loc['Total', 'Total']*100, 1)
df_dist = df_dist.T
df_dist.loc['% Total', '% Total'] = 100
df_dist.iloc[:-1,:-1] = df_dist.iloc[:-1,:-1].astype(np.int32).astype(str)

print('')
display(Markdown(text))
display(df_dist)
df_dist.astype(np.float32).T.iloc[:-1,:-2][['Male','Female','Unknown/Invalid']].plot(kind='bar', barmode="group", title="Distribution: Age - Gender (Absolute)")

Absolute: Distribution of Age by gender.

age [0-10) [10-20) [20-30) [30-40) [40-50) [50-60) [60-70) [70-80) [80-90) [90-100) Total % Total
gender
Female 83 424 1271 2708 6771 12399 16011 19666 14268 2584 76185 53.1
Male 78 309 656 2256 6958 12696 16729 17258 9259 1035 67234 46.9
Unknown/Invalid 0 0 0 0 0 0 1 4 0 0 5 0.0
Total 161 733 1927 4964 13729 25095 32741 36928 23527 3619 143424 100.0
% Total 0.1 0.5 1.3 3.5 9.6 17.5 22.8 25.7 16.4 2.5 100 100.0
In [11]:
text = '<b>Relative %:</b> Show the distribution of Age by gender.  \n'
text += 'How would the distribution look like, if an equal number of men and women were represented.'

df_dist = pd.crosstab(df['gender'], df['age'])
df_dist = np.around(df_dist / df_dist.sum(axis=1)[:,None] * 100,3)
df_dist['Total'] = df_dist.sum(axis=1) 


print('')
display(Markdown(text))
display(df_dist)
df_dist.T[['Male', 'Female']].plot(kind='bar', barmode="group", title="Distribution: Gender - Age (Relative)")

Relative %: Show the distribution of Age by gender.
How would the distribution look like, if an equal number of men and women were represented.

age [0-10) [10-20) [20-30) [30-40) [40-50) [50-60) [60-70) [70-80) [80-90) [90-100) Total
gender
Female 0.109 0.557 1.668 3.555 8.888 16.275 21.016 25.813 18.728 3.392 100.001
Male 0.116 0.460 0.976 3.355 10.349 18.883 24.882 25.669 13.771 1.539 100.000
Unknown/Invalid 0.000 0.000 0.000 0.000 0.000 0.000 20.000 80.000 0.000 0.000 100.000
In [12]:
text = f"**Answer**: From the above information & interactive plots :), we can observe the following things:  \n"
text += f"- There are more Female then Male patients in the dataset: 53.1% vs 46.9%  \n"
text += f"- The most common age group is: [70-80], they represent 25.7% of the data  \n"
text += f"- 92% of the data has a age value between [40 and 90[  \n"
text += f"- Via the relative table and figure, we can notice that up to the age group of 70, we note that men are more often represented than women  \n"
text += f"  (Probably because men die earlier?)  \n"


display(Markdown(text))

Answer: From the above information & interactive plots :), we can observe the following things:

  • There are more Female then Male patients in the dataset: 53.1% vs 46.9%
  • The most common age group is: [70-80], they represent 25.7% of the data
  • 92% of the data has a age value between [40 and 90[
  • Via the relative table and figure, we can notice that up to the age group of 70, we note that men are more often represented than women
    (Probably because men die earlier?)

Reduce Dimensionality of the NDC Code Feature

Question 3: NDC codes are a common format to represent the wide variety of drugs that are prescribed for patient care in the United States. The challenge is that there are many codes that map to the same or similar drug. You are provided with the ndc drug lookup file https://github.com/udacity/nd320-c1-emr-data-starter/blob/master/project/data_schema_references/ndc_lookup_table.csv derived from the National Drug Codes List site(https://ndclist.com/). Please use this file to come up with a way to reduce the dimensionality of this field and create a new field in the dataset called "generic_drug_name" in the output dataframe.

In [13]:
#NDC code lookup file
ndc_code_path = "./medication_lookup_tables/final_ndc_lookup_table"
ndc_code_df = pd.read_csv(ndc_code_path)
ndc_code_df.head(3)
Out[13]:
NDC_Code Proprietary Name Non-proprietary Name Dosage Form Route Name Company Name Product Type
0 0087-6060 Glucophage Metformin Hydrochloride Tablet, Film Coated Oral Bristol-myers Squibb Company Human Prescription Drug
1 0087-6063 Glucophage XR Metformin Hydrochloride Tablet, Extended Release Oral Bristol-myers Squibb Company Human Prescription Drug
2 0087-6064 Glucophage XR Metformin Hydrochloride Tablet, Extended Release Oral Bristol-myers Squibb Company Human Prescription Drug
In [14]:
from student_utils import reduce_dimension_ndc
In [15]:
reduce_dim_df = reduce_dimension_ndc(df, ndc_code_df)
In [16]:
# Number of unique values should be less for the new output field
assert df['ndc_code'].nunique() > reduce_dim_df['generic_drug_name'].nunique()
print(f"Tests passed!! {df['ndc_code'].nunique()} - {reduce_dim_df['generic_drug_name'].nunique()}")
Tests passed!! 251 - 36

Select First Encounter for each Patient

Question 4: In order to simplify the aggregation of data for the model, we will only select the first encounter for each patient in the dataset. This is to reduce the risk of data leakage of future patient encounters and to reduce complexity of the data transformation and modeling steps. We will assume that sorting in numerical order on the encounter_id provides the time horizon for determining which encounters come before and after another.

In [17]:
from student_utils import select_first_encounter
first_encounter_df = select_first_encounter(reduce_dim_df)
In [18]:
# unique patients in transformed dataset
unique_patients = first_encounter_df['patient_nbr'].nunique()
print("Number of unique patients:{}".format(unique_patients))

# unique encounters in transformed dataset
unique_encounters = first_encounter_df['encounter_id'].nunique()
print("Number of unique encounters:{}".format(unique_encounters))

original_unique_patient_number = reduce_dim_df['patient_nbr'].nunique()
# number of unique patients should be equal to the number of unique encounters and patients in the final dataset
assert original_unique_patient_number == unique_patients
assert original_unique_patient_number == unique_encounters
print("Tests passed!!")
Number of unique patients:71518
Number of unique encounters:71518
Tests passed!!

Aggregate Dataset to Right Level for Modeling

In order to provide a broad scope of the steps and to prevent students from getting stuck with data transformations, we have selected the aggregation columns and provided a function to build the dataset at the appropriate level. The 'aggregate_dataset" function that you can find in the 'utils.py' file can take the preceding dataframe with the 'generic_drug_name' field and transform the data appropriately for the project.

To make it simpler for students, we are creating dummy columns for each unique generic drug name and adding those are input features to the model. There are other options for data representation but this is out of scope for the time constraints of the course.

In [19]:
exclusion_list = ['generic_drug_name']
grouping_field_list = [c for c in first_encounter_df.columns if c not in exclusion_list]
In [20]:
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# !!! FILL THE NA's with MISSING, or this method doesn't work !!!
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
agg_drug_df, ndc_col_list = aggregate_dataset(first_encounter_df.fillna('Missing'), grouping_field_list, 'generic_drug_name')
/home/workspace/starter_code/utils.py:10: FutureWarning:

Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

In [21]:
assert len(agg_drug_df) == agg_drug_df['patient_nbr'].nunique() == agg_drug_df['encounter_id'].nunique()
In [22]:
ndc_col_list
Out[22]:
['Acarbose',
 'Afrezza',
 'Amaryl',
 'Avandia_2MG',
 'Avandia_4MG',
 'Glimepiride',
 'Glipizide',
 'Glipizide_And_Metformin_Hcl',
 'Glipizide_And_Metformin_Hydrochloride',
 'Glucophage',
 'Glucophage_XR',
 'Glucotrol',
 'Glucotrol_XL',
 'Glyburide',
 'Glyburide_And_Metformin_Hydrochloride',
 'Glyburide-metformin_Hydrochloride',
 'Glynase',
 'Glyset',
 'Humulin_R',
 'Metformin_Hcl',
 'Metformin_Hydrochloride',
 'Metformin_Hydrochloride_Extended_Release',
 'Miglitol',
 'Missing',
 'Nateglinide',
 'Novolin_R',
 'Pioglitazone',
 'Pioglitazone_Hydrochloride_And_Glimepiride',
 'Prandin',
 'Repaglinide',
 'Riomet',
 'Riomet_Er',
 'Starlix',
 'Tolazamide',
 'Tolbutamide']

Prepare Fields and Cast Dataset

Feature Selection

Question 5: After you have aggregated the dataset to the right level, we can do feature selection (we will include the ndc_col_list, dummy column features too). In the block below, please select the categorical and numerical features that you will use for the model, so that we can create a dataset subset.

For the payer_code and weight fields, please provide whether you think we should include/exclude the field in our model and give a justification/rationale for this based off of the statistics of the data. Feel free to use visualizations or summary statistics to support your choice.

Answer: We wont include payer_code nor weights, because they contain to many missing values.

In [23]:
features_categorical
Out[23]:
['race',
 'gender',
 'age',
 'weight',
 'admission_type_id',
 'discharge_disposition_id',
 'admission_source_id',
 'payer_code',
 'medical_specialty',
 'primary_diagnosis_code',
 'other_diagnosis_codes',
 'ndc_code',
 'max_glu_serum',
 'A1Cresult',
 'change',
 'readmitted']
In [24]:
features_numerical
Out[24]:
['time_in_hospital',
 'number_outpatient',
 'number_inpatient',
 'number_emergency',
 'num_lab_procedures',
 'number_diagnoses',
 'num_medications',
 'num_procedures']
In [25]:
'''
Please update the list to include the features you think are appropriate for the model 
and the field that we will be using to train the model. There are three required demographic features for the model 
and I have inserted a list with them already in the categorical list. 
These will be required for later steps when analyzing data splits and model biases.
'''

required_demo_col_list = ['race', 'gender', 'age']
student_categorical_col_list = ['primary_diagnosis_code'] + required_demo_col_list + ndc_col_list
student_numerical_col_list = ['num_lab_procedures', 'number_diagnoses', 'num_medications', 'num_procedures']
PREDICTOR_FIELD = 'time_in_hospital'
In [26]:
def select_model_features(df, categorical_col_list, numerical_col_list, PREDICTOR_FIELD, grouping_key='patient_nbr'):
    selected_col_list = [grouping_key] + [PREDICTOR_FIELD] + categorical_col_list + numerical_col_list   
    return agg_drug_df[selected_col_list]
In [27]:
selected_features_df = select_model_features(agg_drug_df, student_categorical_col_list, 
                                             student_numerical_col_list,
                                             PREDICTOR_FIELD)

Preprocess Dataset - Casting and Imputing

We will cast and impute the dataset before splitting so that we do not have to repeat these steps across the splits in the next step. For imputing, there can be deeper analysis into which features to impute and how to impute but for the sake of time, we are taking a general strategy of imputing zero for only numerical features.

OPTIONAL: What are some potential issues with this approach? Can you recommend a better way and also implement it?

Answer: if you fill in the missing values with zeros, like "number of previous visits" for example, it can have 2 meanings. 1) the patient does not have a "previous visit value" or 2) we don't know. So depending on the model, e.g. in a Random Forest model, here I would replalce the missing values by -1 (because then the model will know, aha a missing value). However, with models such as a neural network, I would more likely replace this value with the median by age group and gender (if the number of missing values for that particular feature is limited)

In [28]:
processed_df = preprocess_df(selected_features_df, 
                             student_categorical_col_list, 
                             student_numerical_col_list,
                             PREDICTOR_FIELD,
                             categorical_impute_value='nan',
                             numerical_impute_value=0)
/home/workspace/starter_code/utils.py:29: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/home/workspace/starter_code/utils.py:31: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/home/workspace/starter_code/utils.py:33: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Split Dataset into Train, Validation, and Test Partitions

Question 6: In order to prepare the data for being trained and evaluated by a deep learning model, we will split the dataset into three partitions, with the validation partition used for optimizing the model hyperparameters during training. One of the key parts is that we need to be sure that the data does not accidently leak across partitions.

Please complete the function below to split the input dataset into three partitions(train, validation, test) with the following requirements.

  • Approximately 60%/20%/20% train/validation/test split
  • Randomly sample different patients into each data partition
  • IMPORTANT Make sure that a patient's data is not in more than one partition, so that we can avoid possible data leakage.
  • Make sure that the total number of unique patients across the splits is equal to the total number of unique patients in the original dataset
  • Total number of rows in original dataset = sum of rows across all three dataset partitions
In [29]:
from student_utils import patient_dataset_splitter
d_train, d_val, d_test = patient_dataset_splitter(processed_df, 'patient_nbr')
In [30]:
assert len(d_train) + len(d_val) + len(d_test) == len(processed_df)
print("Test passed for number of total rows equal!")
Test passed for number of total rows equal!
In [31]:
assert (d_train['patient_nbr'].nunique() + d_val['patient_nbr'].nunique() + d_test['patient_nbr'].nunique()) == agg_drug_df['patient_nbr'].nunique()
print("Test passed for number of unique patients being equal!")
Test passed for number of unique patients being equal!

Demographic Representation Analysis of Split

After the split, we should check to see the distribution of key features/groups and make sure that there is representative samples across the partitions. The show_group_stats_viz function in the utils.py file can be used to group and visualize different groups and dataframe partitions.

Label Distribution Across Partitions

Below you can see the distributution of the label across your splits. Are the histogram distribution shapes similar across partitions?

In [32]:
pd.options.plotting.backend = "matplotlib"

show_group_stats_viz(processed_df, PREDICTOR_FIELD)
time_in_hospital
1.0     10717
2.0     12397
3.0     12701
4.0      9567
5.0      6839
6.0      5171
7.0      3999
8.0      2919
9.0      1990
10.0     1558
11.0     1241
12.0      955
13.0      795
14.0      669
dtype: int64
AxesSubplot(0.125,0.125;0.775x0.755)
In [33]:
show_group_stats_viz(d_train, PREDICTOR_FIELD)
time_in_hospital
1.0     6437
2.0     7344
3.0     7629
4.0     5737
5.0     4099
6.0     3061
7.0     2472
8.0     1763
9.0     1205
10.0     957
11.0     757
12.0     568
13.0     493
14.0     389
dtype: int64
AxesSubplot(0.125,0.125;0.775x0.755)
In [34]:
show_group_stats_viz(d_test, PREDICTOR_FIELD)
time_in_hospital
1.0     2165
2.0     2490
3.0     2546
4.0     1923
5.0     1359
6.0     1058
7.0      786
8.0      585
9.0      390
10.0     300
11.0     231
12.0     181
13.0     148
14.0     141
dtype: int64
AxesSubplot(0.125,0.125;0.775x0.755)

Demographic Group Analysis

We should check that our partitions/splits of the dataset are similar in terms of their demographic profiles. Below you can see how we might visualize and analyze the full dataset vs. the partitions.

In [35]:
# Full dataset before splitting
patient_demo_features = ['race', 'gender', 'age', 'patient_nbr']
patient_group_analysis_df = processed_df[patient_demo_features].groupby('patient_nbr').head(1).reset_index(drop=True)
show_group_stats_viz(patient_group_analysis_df, 'gender')
gender
Female             38025
Male               33490
Unknown/Invalid        3
dtype: int64
AxesSubplot(0.125,0.125;0.775x0.755)
In [36]:
# Training partition
show_group_stats_viz(d_train, 'gender')
gender
Female             22669
Male               20240
Unknown/Invalid        2
dtype: int64
AxesSubplot(0.125,0.125;0.775x0.755)
In [37]:
# Test partition
show_group_stats_viz(d_test, 'gender')
gender
Female    7626
Male      6677
dtype: int64
AxesSubplot(0.125,0.125;0.775x0.755)

Convert Dataset Splits to TF Dataset

We have provided you the function to convert the Pandas dataframe to TF tensors using the TF Dataset API. Please note that this is not a scalable method and for larger datasets, the 'make_csv_dataset' method is recommended -https://www.tensorflow.org/api_docs/python/tf/data/experimental/make_csv_dataset.

In [38]:
# Convert dataset from Pandas dataframes to TF dataset 
batch_size = 128
diabetes_train_ds = df_to_dataset(d_train, PREDICTOR_FIELD, batch_size=batch_size)
diabetes_val_ds = df_to_dataset(d_val, PREDICTOR_FIELD, batch_size=batch_size)
diabetes_test_ds = df_to_dataset(d_test, PREDICTOR_FIELD, batch_size=batch_size)
In [39]:
# We use this sample of the dataset to show transformations later
diabetes_batch = next(iter(diabetes_train_ds))[0]
def demo(feature_column, example_batch):
    feature_layer = layers.DenseFeatures(feature_column)
    print(feature_layer(example_batch))

4. Create Categorical Features with TF Feature Columns

Build Vocabulary for Categorical Features

Before we can create the TF categorical features, we must first create the vocab files with the unique values for a given field that are from the training dataset. Below we have provided a function that you can use that only requires providing the pandas train dataset partition and the list of the categorical columns in a list format. The output variable 'vocab_file_list' will be a list of the file paths that can be used in the next step for creating the categorical features.

In [40]:
#
# Input: d_train - The training data  (DataFrame)
#        student_categorical_col_list - Categorical features of our dataset
#
# What we want to do?: Writing txt files to a folder, for each Feature (column) in our student_categorical_col_list.
#                      This txt file contains the unique values from the training set & starts with a default value
#


#
# Code to write a single txt file 
#
# Input: vocab_list - a list of values which belongs to a feature (column)
#        field_name - Name of the feature
#        default_value - we only use the training set, thus it can be that a values doesn't exist yet
#                                                      in the validation set, or when we make real predictions
#                                                      therefore we'll use a default value
#        vocab_dir - Location of the file
#
#
# def write_vocabulary_file(vocab_list, field_name, default_value, vocab_dir='./diabetes_vocab/'):
#   
#    output_file_path = os.path.join(vocab_dir, str(field_name) + "_vocab.txt")         # 1) define Location of the file 
#
#    # put default value in first row as TF requires                                    # 2) insert the default value(s)
#    vocab_list = np.insert(vocab_list, 0, default_value, axis=0) 
#
#    df = pd.DataFrame(vocab_list).to_csv(output_file_path, index=None, header=None)    # 3) write the file, via pandas
#    return output_file_path


#
# Code to create the txt files,
#
# def build_vocab_files(df, categorical_column_list, default_value='00'):
#    vocab_files_list = []
#    for c in categorical_column_list:                                               # 1) Loop over the categorical features
#        v_file = write_vocabulary_file(df[c].unique(), c, default_value)            # 2) get the unique values + default
#        vocab_files_list.append(v_file)                                             # 3) save the path
#    return vocab_files_list                                                         # 4) return all the paths
#

vocab_file_list = build_vocab_files(d_train, student_categorical_col_list)
vocab_file_list[:10]
Out[40]:
['./diabetes_vocab/primary_diagnosis_code_vocab.txt',
 './diabetes_vocab/race_vocab.txt',
 './diabetes_vocab/gender_vocab.txt',
 './diabetes_vocab/age_vocab.txt',
 './diabetes_vocab/Acarbose_vocab.txt',
 './diabetes_vocab/Afrezza_vocab.txt',
 './diabetes_vocab/Amaryl_vocab.txt',
 './diabetes_vocab/Avandia_2MG_vocab.txt',
 './diabetes_vocab/Avandia_4MG_vocab.txt',
 './diabetes_vocab/Glimepiride_vocab.txt']

Create Categorical Features with Tensorflow Feature Column API

Question 7: Using the vocab file list from above that was derived fromt the features you selected earlier, please create categorical features with the Tensorflow Feature Column API, https://www.tensorflow.org/api_docs/python/tf/feature_column. Below is a function to help guide you.

In [41]:
# def create_tf_categorical_feature_cols(categorical_col_list,
#                               vocab_dir='./diabetes_vocab/'):
#     '''
#     categorical_col_list: list, categorical field list that will be transformed with TF feature column
#     vocab_dir: string, the path where the vocabulary text files are located
#     return:
#         output_tf_list: list of TF feature columns
#     '''
    
#     # key
#     ## A unique string identifying the input feature.
#     ## It is used as the column name and the dictionary key for feature parsing configs,
#     ## feature Tensor objects, and feature columns.
    
#     # vocabulary_file
#     ## The vocabulary file name.
    
    
#     # default_value
#     ## The integer ID value to return for out-of-vocabulary feature values,
#     ## defaults to -1. This can not be specified with a positive num_oov_buckets.
    
#     # num_oov_buckets
#     ## Non-negative integer, the number of out-of-vocabulary buckets. 
#     ## All out-of-vocabulary inputs will be assigned IDs in the range 
#     ## [vocabulary_size, vocabulary_size+num_oov_buckets) based on a hash of the input value.
#     ## A positive num_oov_buckets can not be specified with default_value.
    
#     output_tf_list = []
#     for c in categorical_col_list:
#         vocab_file_path = os.path.join(vocab_dir,  c + "_vocab.txt")
#         '''
#         Which TF function allows you to read from a text file and create a categorical feature
#         You can use a pattern like this below...
#         tf_categorical_feature_column = tf.feature_column.......

#         '''
#         tf_categorical_feature_column = tf.feature_column\
#                                           .categorical_column_with_vocabulary_file(key=c, 
#                                                                                    vocabulary_file = vocab_file_path, 
#                                                                                    default_value=0)
        
#         output_tf_list.append( tf.feature_column.indicator_column(tf_categorical_feature_column) )
#     return output_tf_list


########
# Info #
########
# Previously: we made some txt files of all the unique values of the features
# 
# Goal: Convert the a Feature (column) into a one hot encoded matrix
#
# How? : tf.feature_column.categorical_column_with_vocabulary_file - will read  a txt file 
#                                                       |            and is basically nothing more then a list of words | values
#                                                       \_________
#        # create a kind of lookup table | dictionary             \
#        layers.DenseFeatures(tf.feature_column.indicator_column( ... ))({'primary_diagnosis_code': ['410','650','2020021']})
#
#        --> thiw will return: 
#                                array([[0., 0., 0., ..., 0., 0., 0.],
#                                       [0., 0., 0., ..., 0., 0., 0.],
#                                       [1., 0., 0., ..., 0., 0., 0.]], dtype=float32)>
#
#         Thus if found  410 and 650  in the txt file but not 2020021
#         And returned the one hot encoded vectors back
In [42]:
from student_utils import create_tf_categorical_feature_cols
tf_cat_col_list = create_tf_categorical_feature_cols(student_categorical_col_list)
INFO:tensorflow:vocabulary_size = 647 in primary_diagnosis_code is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/primary_diagnosis_code_vocab.txt.
INFO:tensorflow:vocabulary_size = 7 in race is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/race_vocab.txt.
INFO:tensorflow:vocabulary_size = 4 in gender is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/gender_vocab.txt.
INFO:tensorflow:vocabulary_size = 11 in age is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/age_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Acarbose is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Acarbose_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Afrezza is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Afrezza_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Amaryl is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Amaryl_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Avandia_2MG is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Avandia_2MG_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Avandia_4MG is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Avandia_4MG_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Glimepiride is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Glimepiride_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Glipizide is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Glipizide_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Glipizide_And_Metformin_Hcl is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Glipizide_And_Metformin_Hcl_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Glipizide_And_Metformin_Hydrochloride is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Glipizide_And_Metformin_Hydrochloride_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Glucophage is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Glucophage_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Glucophage_XR is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Glucophage_XR_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Glucotrol is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Glucotrol_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Glucotrol_XL is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Glucotrol_XL_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Glyburide is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Glyburide_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Glyburide_And_Metformin_Hydrochloride is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Glyburide_And_Metformin_Hydrochloride_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Glyburide-metformin_Hydrochloride is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Glyburide-metformin_Hydrochloride_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Glynase is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Glynase_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Glyset is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Glyset_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Humulin_R is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Humulin_R_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Metformin_Hcl is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Metformin_Hcl_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Metformin_Hydrochloride is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Metformin_Hydrochloride_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Metformin_Hydrochloride_Extended_Release is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Metformin_Hydrochloride_Extended_Release_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Miglitol is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Miglitol_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Missing is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Missing_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Nateglinide is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Nateglinide_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Novolin_R is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Novolin_R_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Pioglitazone is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Pioglitazone_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Pioglitazone_Hydrochloride_And_Glimepiride is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Pioglitazone_Hydrochloride_And_Glimepiride_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Prandin is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Prandin_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Repaglinide is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Repaglinide_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Riomet is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Riomet_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Riomet_Er is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Riomet_Er_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Starlix is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Starlix_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Tolazamide is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Tolazamide_vocab.txt.
INFO:tensorflow:vocabulary_size = 3 in Tolbutamide is inferred from the number of elements in the vocabulary_file ./diabetes_vocab/Tolbutamide_vocab.txt.
In [43]:
test_cat_var1 = tf_cat_col_list[0]
print("Example categorical field:\n{}".format(test_cat_var1))
demo(test_cat_var1, diabetes_batch)
Example categorical field:
IndicatorColumn(categorical_column=VocabularyFileCategoricalColumn(key='primary_diagnosis_code', vocabulary_file='./diabetes_vocab/primary_diagnosis_code_vocab.txt', vocabulary_size=647, num_oov_buckets=0, dtype=tf.string, default_value=0))
WARNING:tensorflow:From /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4267: IndicatorColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
WARNING:tensorflow:From /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4322: VocabularyFileCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
tf.Tensor(
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]], shape=(128, 647), dtype=float32)

5. Create Numerical Features with TF Feature Columns

Question 8: Using the TF Feature Column API(https://www.tensorflow.org/api_docs/python/tf/feature_column/), please create normalized Tensorflow numeric features for the model. Try to use the z-score normalizer function below to help as well as the 'calculate_stats_from_train_data' function.

In [44]:
from student_utils import create_tf_numeric_feature

For simplicity the create_tf_numerical_feature_cols function below uses the same normalizer function across all features(z-score normalization) but if you have time feel free to analyze and adapt the normalizer based off the statistical distributions. You may find this as a good resource in determining which transformation fits best for the data https://developers.google.com/machine-learning/data-prep/transform/normalization.

In [45]:
def calculate_stats_from_train_data(df, col):
    mean = df[col].describe()['mean']
    std = df[col].describe()['std']
    return mean, std

def create_tf_numerical_feature_cols(numerical_col_list, train_df):
    tf_numeric_col_list = []
    for c in numerical_col_list:
        mean, std = calculate_stats_from_train_data(train_df, c)
        tf_numeric_feature = create_tf_numeric_feature(c, mean, std)
        tf_numeric_col_list.append(tf_numeric_feature)
    return tf_numeric_col_list
In [46]:
tf_cont_col_list = create_tf_numerical_feature_cols(student_numerical_col_list, d_train)
In [47]:
test_cont_var1 = tf_cont_col_list[0]
print("Example continuous field:\n{}\n".format(test_cont_var1))
demo(test_cont_var1, diabetes_batch)
Example continuous field:
NumericColumn(key='num_lab_procedures', shape=(1,), default_value=(0,), dtype=tf.float64, normalizer_fn=functools.partial(<function normalize_numeric_with_zscore at 0x7fbae85598c0>, mean=43.09440469809606, std=19.930749701238398))

tf.Tensor(
[[ 0.        ]
 [ 1.        ]
 [ 0.15789473]
 [-0.21052632]
 [ 0.68421054]
 [ 0.7368421 ]
 [-1.2631578 ]
 [-0.6315789 ]
 [-1.6842105 ]
 [ 0.05263158]
 [-2.2105262 ]
 [ 0.7368421 ]
 [-1.1578947 ]
 [-1.2631578 ]
 [-0.47368422]
 [ 2.0526316 ]
 [ 0.68421054]
 [-0.6315789 ]
 [ 0.7368421 ]
 [-0.7894737 ]
 [-1.7894737 ]
 [-2.1052632 ]
 [-1.3684211 ]
 [-0.47368422]
 [-0.15789473]
 [-0.31578946]
 [ 0.42105263]
 [-0.8947368 ]
 [ 0.05263158]
 [-0.36842105]
 [ 0.8947368 ]
 [ 1.        ]
 [-0.31578946]
 [ 0.15789473]
 [-0.57894737]
 [ 1.2631578 ]
 [-0.47368422]
 [-1.8421053 ]
 [-0.8947368 ]
 [-0.42105263]
 [ 0.94736844]
 [ 1.3157895 ]
 [ 0.31578946]
 [ 0.31578946]
 [ 0.8947368 ]
 [ 1.0526316 ]
 [ 0.05263158]
 [-0.15789473]
 [ 1.4210526 ]
 [ 0.05263158]
 [ 0.21052632]
 [-0.5263158 ]
 [-0.31578946]
 [-1.7894737 ]
 [-1.7894737 ]
 [-2.2105262 ]
 [ 0.8947368 ]
 [ 2.0526316 ]
 [ 0.94736844]
 [ 2.4210527 ]
 [ 0.42105263]
 [-2.2105262 ]
 [-1.6842105 ]
 [-0.47368422]
 [-1.6842105 ]
 [-0.05263158]
 [-2.2105262 ]
 [-2.2105262 ]
 [-0.15789473]
 [-1.2105263 ]
 [-1.1052631 ]
 [-0.15789473]
 [ 1.2105263 ]
 [-2.2105262 ]
 [ 0.2631579 ]
 [ 1.6315789 ]
 [-0.31578946]
 [-0.2631579 ]
 [-0.57894737]
 [ 1.1578947 ]
 [-0.42105263]
 [ 1.0526316 ]
 [ 1.3684211 ]
 [ 0.36842105]
 [-0.36842105]
 [ 0.47368422]
 [ 2.2105262 ]
 [-1.1052631 ]
 [ 0.2631579 ]
 [ 0.10526316]
 [ 0.15789473]
 [-0.7894737 ]
 [-2.2105262 ]
 [-1.2631578 ]
 [-0.21052632]
 [-0.5263158 ]
 [ 0.5263158 ]
 [-0.15789473]
 [-1.6315789 ]
 [ 0.2631579 ]
 [ 0.10526316]
 [ 1.        ]
 [-0.05263158]
 [-0.05263158]
 [ 0.47368422]
 [-2.2105262 ]
 [ 0.31578946]
 [-2.1578948 ]
 [-0.05263158]
 [ 1.4736842 ]
 [-2.2105262 ]
 [-0.21052632]
 [-2.1578948 ]
 [ 1.8421053 ]
 [ 0.15789473]
 [ 0.8947368 ]
 [ 0.68421054]
 [ 0.05263158]
 [ 0.10526316]
 [-0.68421054]
 [-0.21052632]
 [-0.15789473]
 [ 0.42105263]
 [ 0.        ]
 [-0.7894737 ]
 [-0.84210527]
 [ 0.57894737]
 [-1.1578947 ]], shape=(128, 1), dtype=float32)

6. Build Deep Learning Regression Model with Sequential API and TF Probability Layers

Use DenseFeatures to combine features for model

Now that we have prepared categorical and numerical features using Tensorflow's Feature Column API, we can combine them into a dense vector representation for the model. Below we will create this new input layer, which we will call 'claim_feature_layer'.

In [48]:
claim_feature_columns = tf_cat_col_list + tf_cont_col_list
claim_feature_layer = tf.keras.layers.DenseFeatures(claim_feature_columns)

Build Sequential API Model from DenseFeatures and TF Probability Layers

Below we have provided some boilerplate code for building a model that connects the Sequential API, DenseFeatures, and Tensorflow Probability layers into a deep learning model. There are many opportunities to further optimize and explore different architectures through benchmarking and testing approaches in various research papers, loss and evaluation metrics, learning curves, hyperparameter tuning, TF probability layers, etc. Feel free to modify and explore as you wish.

OPTIONAL: Come up with a more optimal neural network architecture and hyperparameters. Share the process in discovering the architecture and hyperparameters.

In [49]:
def build_sequential_model(feature_layer):
    model = tf.keras.Sequential([
        feature_layer,
        tf.keras.layers.Dropout(.1),
        tf.keras.layers.Dense(150, activation='relu'),
        tf.keras.layers.Dropout(.4),
        tf.keras.layers.Dense(75, activation='relu'),
        tfp.layers.DenseVariational(1+1, posterior_mean_field, prior_trainable),
        tfp.layers.DistributionLambda(
            lambda t:tfp.distributions.Normal(loc=t[..., :1],
                                             scale=1e-3 + tf.math.softplus(0.01 * t[...,1:])
                                             )
        ),
    ])
    return model

def build_diabetes_model(train_ds, val_ds,  feature_layer,  epochs=5, loss_metric='mse'):
    model = build_sequential_model(feature_layer)
    model.compile(optimizer='adam', loss=loss_metric, metrics=[loss_metric])
    early_stop = tf.keras.callbacks.EarlyStopping(monitor=loss_metric, patience=3)     
    history = model.fit(train_ds, validation_data=val_ds,
                        callbacks=[early_stop],
                        epochs=epochs)
    return model, history 
In [50]:
diabetes_model, history = build_diabetes_model(diabetes_train_ds, diabetes_val_ds,  claim_feature_layer,  epochs=25)
Train for 336 steps, validate for 112 steps
Epoch 1/25
336/336 [==============================] - 11s 34ms/step - loss: 26.1228 - mse: 26.0013 - val_loss: 17.7624 - val_mse: 17.2347
Epoch 2/25
336/336 [==============================] - 7s 22ms/step - loss: 16.9691 - mse: 16.3274 - val_loss: 12.7751 - val_mse: 11.9934
Epoch 3/25
336/336 [==============================] - 7s 21ms/step - loss: 12.9241 - mse: 12.0974 - val_loss: 11.3847 - val_mse: 10.8706
Epoch 4/25
336/336 [==============================] - 7s 22ms/step - loss: 12.1408 - mse: 11.4904 - val_loss: 10.1094 - val_mse: 9.3382
Epoch 5/25
336/336 [==============================] - 8s 24ms/step - loss: 10.7746 - mse: 10.1546 - val_loss: 10.1416 - val_mse: 9.4417
Epoch 6/25
336/336 [==============================] - 7s 21ms/step - loss: 10.3336 - mse: 9.7126 - val_loss: 8.9910 - val_mse: 8.4718
Epoch 7/25
336/336 [==============================] - 8s 23ms/step - loss: 9.5043 - mse: 9.0066 - val_loss: 8.6575 - val_mse: 8.2467
Epoch 8/25
336/336 [==============================] - 8s 25ms/step - loss: 9.2939 - mse: 8.8560 - val_loss: 9.5029 - val_mse: 9.0632
Epoch 9/25
336/336 [==============================] - 7s 21ms/step - loss: 9.1734 - mse: 8.6927 - val_loss: 9.0490 - val_mse: 8.5117
Epoch 10/25
336/336 [==============================] - 8s 23ms/step - loss: 8.9041 - mse: 8.5165 - val_loss: 7.7493 - val_mse: 7.2612
Epoch 11/25
336/336 [==============================] - 7s 21ms/step - loss: 8.6044 - mse: 8.2017 - val_loss: 7.6917 - val_mse: 7.1923
Epoch 12/25
336/336 [==============================] - 7s 19ms/step - loss: 8.6185 - mse: 8.1637 - val_loss: 7.6411 - val_mse: 7.2502
Epoch 13/25
336/336 [==============================] - 7s 20ms/step - loss: 8.4805 - mse: 8.1072 - val_loss: 7.5409 - val_mse: 7.1394
Epoch 14/25
336/336 [==============================] - 8s 23ms/step - loss: 8.2597 - mse: 7.7813 - val_loss: 7.8507 - val_mse: 7.4807
Epoch 15/25
336/336 [==============================] - 7s 22ms/step - loss: 8.1645 - mse: 7.7616 - val_loss: 8.0184 - val_mse: 7.6960
Epoch 16/25
336/336 [==============================] - 7s 22ms/step - loss: 8.1430 - mse: 7.7490 - val_loss: 7.8144 - val_mse: 7.2999
Epoch 17/25
336/336 [==============================] - 7s 22ms/step - loss: 7.9572 - mse: 7.4755 - val_loss: 7.2937 - val_mse: 6.9333
Epoch 18/25
336/336 [==============================] - 7s 20ms/step - loss: 7.9069 - mse: 7.4689 - val_loss: 6.9724 - val_mse: 6.5405
Epoch 19/25
336/336 [==============================] - 7s 21ms/step - loss: 7.6198 - mse: 7.2349 - val_loss: 7.4172 - val_mse: 7.0692
Epoch 20/25
336/336 [==============================] - 7s 20ms/step - loss: 7.6643 - mse: 7.3164 - val_loss: 7.4298 - val_mse: 6.9754
Epoch 21/25
336/336 [==============================] - 7s 20ms/step - loss: 7.7127 - mse: 7.3688 - val_loss: 7.6046 - val_mse: 7.2057
Epoch 22/25
336/336 [==============================] - 7s 20ms/step - loss: 7.8055 - mse: 7.3972 - val_loss: 7.4104 - val_mse: 7.0057

Show Model Uncertainty Range with TF Probability

Question 9: Now that we have trained a model with TF Probability layers, we can extract the mean and standard deviation for each prediction. Please fill in the answer for the m and s variables below. The code for getting the predictions is provided for you below.

In [51]:
feature_list = student_categorical_col_list + student_numerical_col_list
diabetes_x_tst = dict(d_test[feature_list])
diabetes_yhat = diabetes_model(diabetes_x_tst)
preds = diabetes_model.predict(diabetes_test_ds)
In [52]:
from student_utils import get_mean_std_from_preds
m, s = get_mean_std_from_preds(diabetes_yhat)

Show Prediction Output

In [53]:
prob_outputs = {
    "pred": preds.flatten(),
    "actual_value": d_test['time_in_hospital'].values,
    "pred_mean": m.numpy().flatten(),
    "pred_std": s.numpy().flatten()
}
prob_output_df = pd.DataFrame(prob_outputs)
In [54]:
prob_output_df.head()
Out[54]:
pred actual_value pred_mean pred_std
0 6.794106 1.0 1.412620 0.692510
1 3.570023 6.0 5.323063 0.692250
2 3.086950 1.0 1.426101 0.692755
3 4.639755 2.0 2.423483 0.693143
4 3.206249 12.0 2.815374 0.694139

Convert Regression Output to Classification Output for Patient Selection

Question 10: Given the output predictions, convert it to a binary label for whether the patient meets the time criteria or does not (HINT: use the mean prediction numpy array). The expected output is a numpy array with a 1 or 0 based off if the prediction meets or doesnt meet the criteria.

In [55]:
from student_utils import get_student_binary_prediction
student_binary_prediction = get_student_binary_prediction(prob_output_df, 'pred_mean')

Add Binary Prediction to Test Dataframe

Using the student_binary_prediction output that is a numpy array with binary labels, we can use this to add to a dataframe to better visualize and also to prepare the data for the Aequitas toolkit. The Aequitas toolkit requires that the predictions be mapped to a binary label for the predictions (called 'score' field) and the actual value (called 'label_value').

In [56]:
def add_pred_to_test(test_df, pred_np, demo_col_list):
    for c in demo_col_list:
        test_df[c] = test_df[c].astype(str)
    test_df['score'] = pred_np
    test_df['label_value'] = test_df['time_in_hospital'].apply(lambda x: 1 if x >=5 else 0)
    return test_df

pred_test_df = add_pred_to_test(d_test, student_binary_prediction, ['race', 'gender'])
In [57]:
pred_test_df[['patient_nbr', 'gender', 'race', 'time_in_hospital', 'score', 'label_value']].head()
Out[57]:
patient_nbr gender race time_in_hospital score label_value
44903 35491419 Female Hispanic 1.0 0 0
50532 60219405 Female Caucasian 6.0 1 1
56450 42829641 Male Caucasian 1.0 0 0
45451 97296246 Female Caucasian 2.0 0 0
31632 36122508 Female Hispanic 12.0 0 1

Model Evaluation Metrics

In [58]:
from sklearn.metrics import confusion_matrix, roc_auc_score, f1_score
import seaborn as sns

Question 11: Now it is time to use the newly created binary labels in the 'pred_test_df' dataframe to evaluate the model with some common classification metrics. Please create a report summary of the performance of the model and be sure to give the ROC AUC, F1 score(weighted), class precision and recall scores.

For the report please be sure to include the following three parts:

  • With a non-technical audience in mind, explain the precision-recall tradeoff in regard to how you have optimized your model.

  • What are some areas of improvement for future iterations?

Confusion matrix

In [59]:
# calculate the confusion matrix
tn, fp, fn, tp = confusion_matrix(y_true = pred_test_df['label_value'], y_pred = pred_test_df['score']).ravel()

# re order the matrx
cf_matrix = np.array([[tp, fp],[fn, tn]])

# plot
plt.figure(figsize=[13,8])
group_names = ['True Pos', 'False Pos', 'False Neg', 'True Neg']

group_counts = ["{0:0.0f}".format(value) for value in  cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]

labels = np.asarray(labels).reshape(2,2)
sns.set(font_scale=1.6)

sns.heatmap(pd.DataFrame(cf_matrix, columns=['Real >= 5', 'Real < 5'], index=['Pred >= 5', 'Pred < 5']), annot=labels, fmt="", cmap='Blues');

AUC

In [60]:
roc_auc_score(pred_test_df['label_value'], pred_test_df['score'])
Out[60]:
0.6738439660250706

F1 Score

The F1 score is the harmonic mean of the precision and recall

$$F_{1} = \frac{tp}{tp + \left ( tp+fn \right )*0.5}$$

The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero. The F1 score is also known as the Sørensen–Dice coefficient or Dice similarity coefficient (DSC)

In [61]:
f1_score(pred_test_df['label_value'], pred_test_df['score'], average='weighted')
Out[61]:
0.7225065825946758

Precision

How well does the model classify?
E.g. if the model predicts 100 times True, how many times was it actually correct.
If this is a high number, then you can trust the model as soon as it predicts a true value. This does not mean that the model always predicts True.
It may be that the model very often predicts (incorrectly) False,
but as soon as we observe a True value then we are pretty sure that it is indeed true.

In [62]:
tp / (tp + fp)
Out[62]:
0.8125490196078431

Recall

How many people with a true label, have we been able to classify correctly?

In [63]:
tp / (tp + fn)
Out[63]:
0.4000772349874493

Precision-Recall Trade-Off

The idea behind the precision-recall trade-off is that if you want to focus on a higher precision
(this by raising or lowering the threshold) you will lower the recall. (vice versa for the recall)

E.g.

Threshold       = 0.5
Real Value      = [ 0,   1,    1,    1    0,   0,   1,    0,    1,    1  ]
Predicted Value = [ 0.1, 0.45, 0.75, 0.6, 0.2, 0.4, 0.45, 0.33, 0.22, 0.9]
Predicted class = [ 0,   0,    1,    1    0,   0,   0,    0,    0,    1  ]

TP = 3
FP = 0
FN = 3

--> precision : 3 / ( 3 + 0 ) = 100 %
    recall    : 3 / ( 3 + 3 ) = 50 %

from the above example, we notice that our fictional model has a precision of 100%.
(If it predicts, True --> then the real value was indeed True)
But if we ask the question, how many Real True values did it predict correctly, then the answer is only 50%

Threshold       = 0.3
Real Value      = [ 0,   1,    1,    1    0,   0,   1,    0,    1,    1  ]
Predicted Value = [ 0.1, 0.45, 0.75, 0.6, 0.2, 0.4, 0.45, 0.33, 0.22, 0.9]
Predicted class = [ 0,   1,    1,    1    0,   1,   1,    1,    0,    1  ]

TP = 5
FP = 2
FN = 1

--> precision : 5 / ( 5 + 2 ) = 71 %
    recall    : 5 / ( 5 + 1 ) = 83 %

If lowering the threshold to 0.3.
Then the model will classify a sample much much much faster as True.
(which reduces the number of FNs !)
(because a FN, is the number of positive samples which were faulty classified as Negative)
--> And thus, the Recall will increase.

But when the Recall increases, the Precision will decrease. Becaues, we'll now generate more FPs.

Potential Improvements:

  • Experiment with:
    • The learning rate
    • Adaptive learning rate
    • Drop-outs,
    • Optimizer
    • Nr of epochs
    • Other None values
  • Find other features
    • New features
    • Make use of previouse hospitalization info
  • Target the task as a classification issue and use cross entropy

7. Evaluating Potential Model Biases with Aequitas Toolkit

Prepare Data For Aequitas Bias Toolkit

Using the gender and race fields, we will prepare the data for the Aequitas Toolkit.

In [64]:
# Aequitas
from aequitas.preprocessing import preprocess_input_df
from aequitas.group import Group
from aequitas.plotting import Plot
from aequitas.bias import Bias
from aequitas.fairness import Fairness

ae_subset_df = pred_test_df[['race', 'gender', 'score', 'label_value']]
ae_df, _ = preprocess_input_df(ae_subset_df)

g = Group()
xtab, _ = g.get_crosstabs(ae_df)
absolute_metrics = g.list_absolute_metrics(xtab)
clean_xtab = xtab.fillna(-1)
aqp = Plot()
b = Bias()
/opt/conda/lib/python3.7/site-packages/aequitas/group.py:143: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

model_id, score_thresholds 1 {'rank_abs': [2550]}

Reference Group Selection

Below we have chosen the reference group for our analysis but feel free to select another one.

In [65]:
# test reference group with Caucasian Male
bdf = b.get_disparity_predefined_groups(clean_xtab, 
                    original_df=ae_df, 
                    ref_groups_dict={'race':'Caucasian', 'gender':'Male'
                                     }, 
                    alpha=0.05, 
                    check_significance=False)


f = Fairness()
fdf = f.get_group_value_fairness(bdf)
get_disparity_predefined_group()
/opt/conda/lib/python3.7/site-packages/aequitas/bias.py:368: FutureWarning:

The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead

/opt/conda/lib/python3.7/site-packages/aequitas/fairness.py:32: FutureWarning:

The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead

/opt/conda/lib/python3.7/site-packages/aequitas/fairness.py:45: FutureWarning:

The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead

Race and Gender Bias Analysis for Patient Selection

Question 12: For the gender and race fields, please plot two metrics that are important for patient selection below and state whether there is a significant bias in your model across any of the groups along with justification for your statement.

In [66]:
# Plot two metrics

# Is there significant bias in your model for either race or gender?
absolute_metrics = g.list_absolute_metrics(xtab)
xtab[[col for col in xtab.columns if col not in absolute_metrics]]
Out[66]:
model_id score_threshold k attribute_name attribute_value pp pn fp fn tn tp group_label_pos group_label_neg group_size total_entities
0 1 binary 0/1 2550 race AfricanAmerican 458 2106 94 569 1537 364 933 1631 2564 14303
1 1 binary 0/1 2550 race Asian 19 81 4 17 64 15 32 68 100 14303
2 1 binary 0/1 2550 race Caucasian 1941 8746 361 2315 6431 1580 3895 6792 10687 14303
3 1 binary 0/1 2550 race Hispanic 38 268 7 70 198 31 101 205 306 14303
4 1 binary 0/1 2550 race Missing 61 320 9 90 230 52 142 239 381 14303
5 1 binary 0/1 2550 race Other 33 232 3 46 186 30 76 189 265 14303
6 1 binary 0/1 2550 gender Female 1366 6260 253 1710 4550 1113 2823 4803 7626 14303
7 1 binary 0/1 2550 gender Male 1184 5493 225 1397 4096 959 2356 4321 6677 14303
In [67]:
xtab[['attribute_name', 'attribute_value'] + absolute_metrics].round(2)
Out[67]:
attribute_name attribute_value tpr tnr for fdr fpr fnr npv precision ppr pprev prev
0 race AfricanAmerican 0.39 0.94 0.27 0.21 0.06 0.61 0.73 0.79 0.18 0.18 0.36
1 race Asian 0.47 0.94 0.21 0.21 0.06 0.53 0.79 0.79 0.01 0.19 0.32
2 race Caucasian 0.41 0.95 0.26 0.19 0.05 0.59 0.74 0.81 0.76 0.18 0.36
3 race Hispanic 0.31 0.97 0.26 0.18 0.03 0.69 0.74 0.82 0.01 0.12 0.33
4 race Missing 0.37 0.96 0.28 0.15 0.04 0.63 0.72 0.85 0.02 0.16 0.37
5 race Other 0.39 0.98 0.20 0.09 0.02 0.61 0.80 0.91 0.01 0.12 0.29
6 gender Female 0.39 0.95 0.27 0.19 0.05 0.61 0.73 0.81 0.54 0.18 0.37
7 gender Male 0.41 0.95 0.25 0.19 0.05 0.59 0.75 0.81 0.46 0.18 0.35
In [68]:
_ = aqp.plot_group_metric(xtab, group_metric= 'tpr')
In [69]:
_ = aqp.plot_group_metric(xtab, group_metric= 'tnr')
In [70]:
_ = aqp.plot_group_metric(xtab, group_metric= 'precision')
In [71]:
_ = aqp.plot_group_metric(xtab, group_metric= 'npv')
In [72]:
_ = aqp.plot_group_metric(xtab, group_metric= 'ppr')

Answer:


  REAL
TP | FP   
-------  <-- predicted
FN | TN
- TPR: recall -> TP / (TP + FN)      # How many real positive values where classified positive
 - TNR: TN / (FP + TN)                # How many real negative values where classified negative

 - Precision: TP / (TP + FP)          # How many Positive predictions where indeed correctly classified as postitive
 - NPV: TN / (TN + FN)                # How many Negative predictions where indeed correctly classified as Negative

From the above image and tables, we see no direct bias towards Males and Females (Gender).
Nor do we see a clear bias towards race for the next metrics: TPR, TNR, PPV (precision), NPV

However, we do see a large deviation towards the PPR (Predicted Positive Ratio 𝑘 Disparity) metric.
This might be a point which we need to lt we need to look at more closely in the future.

Fairness Analysis Example - Relative to a Reference Group

Question 13: Earlier we defined our reference group and then calculated disparity metrics relative to this grouping. Please provide a visualization of the fairness evaluation for this reference group and analyze whether there is disparity.

In [73]:
fpr_disparity = aqp.plot_disparity(bdf, group_metric='fpr_disparity', attribute_name='race')

Answer: All the race groups have a +/- similar chance of being classified as a False Postitive.
other, followed by the Hispanics have the lowest chance to be classified as a False Positive (resp. 0,3 and 0,64)

The probability that a positive prediction is wrong, is almost 1.5 - 3 x lower.
(Bear in mind that, Other and Hispanics do also represent the least number of patients)

In [ ]: